反事实说明代表了对数据样本的最小变化,其改变其预测分类,通常是从不利的初始类到所需的目标类别。反事实可以帮助回答问题,例如“需要更改此申请以获得贷款的需要?”。一些最近提出的反事实的方法涉及“合理的”反事实和方法的不同定义。然而,许多这些方法是计算密集的,并提供不符合的解释。在这里,我们介绍了锐利的程序,这是一个用于通过创建分类为目标类的输入的投影版本来启动的二进制分类方法。然后在输入及其投影之间的插值线上的潜在空间中生成反事实候选者。然后,我们展示了我们的框架通过使用学习的陈述将样本的核心特征转化为其反事实。此外,我们表明Strappooter在表格和图像数据集上跨越普通质量指标具有竞争力,同时在现实主义测量中的两个可比方法和擅长的级别,使其适用于需要及时解释的高速机器学习应用。
translated by 谷歌翻译
Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.
translated by 谷歌翻译
Scoring the factuality of a generated summary involves measuring the degree to which a target text contains factual information using the input document as support. Given the similarities in the problem formulation, previous work has shown that Natural Language Inference models can be effectively repurposed to perform this task. As these models are trained to score entailment at a sentence level, several recent studies have shown that decomposing either the input document or the summary into sentences helps with factuality scoring. But is fine-grained decomposition always a winning strategy? In this paper we systematically compare different granularities of decomposition -- from document to sub-sentence level, and we show that the answer is no. Our results show that incorporating additional context can yield improvement, but that this does not necessarily apply to all datasets. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance, highlighting the need for caution in model and methodology selection for downstream tasks.
translated by 谷歌翻译
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we: - announce the first public release of the astroBERT language model; - show how astroBERT improves over existing public language models on astrophysics specific tasks; - and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.
translated by 谷歌翻译
Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.
translated by 谷歌翻译
2型糖尿病(T2DM)的早期诊断对于及时的治疗干预措施和生活方式改变至关重要。随着医学成像数据在许多患者群体中变得更广泛可用,我们试图研究是否可以在表格学习分类器模型中利用图像衍生的表型数据来预测T2DM的发病率,而无需使用侵入性血液实验室测量。我们表明,使用图像衍生表型的神经网络和决策树模型都可以预测患者T2DM状态的召回评分高达87.6%。我们还提出了与“ Syntha1c编码器”相同的结构的新颖使用,这些结构能够输出模仿血液血红蛋白A1C经验实验室测量值的可解释值。最后,我们证明了T2DM风险预测模型对输入矢量成分中小扰动的敏感性可用于预测从以前看不见的患者人群中取样的协变量的性能。
translated by 谷歌翻译
我们介绍了一项对自然语言(NL)推理的人类通知,开放域和逻辑上复杂且多样的数据集,配备了一阶逻辑(fol)注释。对开本由1,435个示例(独特的结论)组成,每个示例与487组前提之一搭配,这些场所作为规则,可用于演绎理由,以理解每个结论的有效性。前提和结论的逻辑正确性是通过其平行注释来确保的,这些注释会自动由我们的FOL推理引擎验证。除了主要的NL推理任务外,对开本中的NL-FOL对自动构成了使用FOL作为逻辑形式的新的NL-FOL翻译数据集。我们对广泛的实验系统地评估了对中型语言模型(BERT,ROBERTA)进行微调的FOL推理能力,并且在大型语言模型(GPT-NEOX,OPT,OPT,GPT-3,Codex)上促成了很少的射击。对于NL-FOL翻译,我们尝试使用GPT-3和Codex。我们的结果表明,公开可用的最强大的大语言模型之一(LLM),GPT-3 Davinci,仅比随机结果略好,而在一部分集的一部分中,该模型尤其不好,并且在预测该模型方面尤其不好。纠正虚假和未知结论的真实价值。我们的数据集和代码可在https://github.com/yale-lily/folio上找到。
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
机器人完成任务的能力在很大程度上取决于其物理设计。但是,确定最佳的物理设计及其相应的控制策略本质上是具有挑战性的。选择链接的数量,类型以及如何在组合设计空间中结果产生的自由,以及对该空间中任何设计的评估都需要得出其最佳控制器。在这项工作中,我们提出了N-LIMB,这是一种在大量形态上优化机器人设计和控制的有效方法。我们框架的核心是一种通用设计条件的控制策略,能够控制各种设计集。这项政策通过允许在设计中转移经验并降低评估新设计的成本,从而大大提高了我们方法的样本效率。我们训练这项政策,以最大程度地提高预期回报,而在设计的分布中,该政策同时更新为普遍政策下的高性能设计。通过这种方式,我们的方法收敛于设计分布,围绕高性能设计和控制器的控制器有效地进行了微调。我们在各种地形的一系列运动任务上展示了我们方法的潜力,并展示了发现小说和高性能的设计控制对。
translated by 谷歌翻译
随着各种科学领域中数据的越来越多,生成模型在科学方法的每个步骤中都具有巨大的潜力来加速科学发现。他们最有价值的应用也许在于传统上提出假设最慢,最具挑战性的步骤。现在,正在从大量数据中学到强大的表示形式,以产生新的假设,这对从材料设计到药物发现的科学发现应用产生了重大影响。 GT4SD(https://github.com/gt4sd/gt4sd-core)是一个可扩展的开放源库,使科学家,开发人员和研究人员能够培训和使用科学发现中假设生成的最先进的生成模型。 GT4SD支持跨材料科学和药物发现的各种生成模型的用途,包括基于与目标蛋白,OMIC剖面,脚手架距离,结合能等性质的分子发现和设计。
translated by 谷歌翻译